A Regularized Latent Semantic Indexing: A New Approach to Large Scale Topic Modeling

نویسندگان

  • Quan Wang
  • Jun Xu
  • Hang Li
  • Nick Craswell
چکیده

Topic modeling provides a powerful way to analyze the content of a collection of documents. It has become a popular tool in research areas such as text mining, information retrieval, natural language processing, and other related fields. In realworld applications, however, the usefulness of topic modeling is limited due to scalability issues. Scaling to larger document collections via parallelization is an active area of research, but most solutions require drastic steps such as vastly reducing input vocabulary. In this paper we introduce Regularized Latent Semantic Indexing (RLSI) including a batch version and an online version, referred to as batch RLSI and online RLSI. Batch RLSI and online RLSI are as effective as existing topic modeling techniques, and can scale to larger datasets without reducing input vocabulary. Moreover, online RLSI can be applied to stream data and capture the dynamic evolution of topics. Both versions of RLSI formalize topic modeling as a problem of minimizing a quadratic loss function regularized by l1 and/or l2 norm. This formulation allows the learning process to be decomposed into multiple sub-optimization problems which can be optimized in parallel, for example via MapReduce. We particularly propose adopting l1 norm on topics and l2 norm on document representations, to create a model with compact and readable topics and useful for retrieval. In learning, batch RLSI processes all the documents in the collection as a whole, while online RLSI processes the documents in the collection one by one. We also prove the convergence of the learning of online RLSI. Relevance ranking experiments on three TREC datasets show that batch RLSI and online RLSI perform better than LSI, PLSI, LDA, and NMF, and the improvements are sometimes statistically significant. Experiments on a web dataset, containing about 1.6 million documents and 7 million terms, demonstrate a similar boost in performance.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic keyword extraction using Latent Dirichlet Allocation topic modeling: Similarity with golden standard and users' evaluation

Purpose: This study investigates the automatic keyword extraction from the table of contents of Persian e-books in the field of science using LDA topic modeling, evaluating their similarity with golden standard, and users' viewpoints of the model keywords. Methodology: This is a mixed text-mining research in which LDA topic modeling is used to extract keywords from the table of contents of sci...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

Topic Modeling via Nonnegative Matrix Factorization on Probability Simplex

One important goal of document modeling is to extract a set of informative topics from a text corpus and produce a reduced representation of each document. In this paper, we propose a novel algorithm for this task based on nonnegative matrix factorization on a probability simplex. We further extend our algorithm by removing global and generic information to produce more diverse and specific top...

متن کامل

Dimensionality Reduction and Topic Modeling: From Latent Semantic Indexing to Latent Dirichlet Allocation and Beyond

The bag-of-words representation commonly used in text analysis can be analyzed very efficiently and retains a great deal of useful information, but it is also troublesome because the same thought can be expressed using many different terms or one term can have very different meanings. Dimension reduction can collapse together terms that have the same semantics, to identify and disambiguate term...

متن کامل

Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or semantic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012